Load the data from the dataset.
In this part, we focused the exploration on the geographical part. From the original dataset, we have both the ‘STATE’ attribute and ‘CITY’ attribute for individual and committee. So at the beginning of the analysis, we want to apply these data to our analysis.

cm <- read.csv(file="C:\\Users\\jiachen xu\\Desktop\\Columbia Course\\Data Visualization\\Homework\\project\\cm.txt", header = FALSE, sep="|")
cm_header <- read.csv(file="C:\\Users\\jiachen xu\\Desktop\\Columbia Course\\Data Visualization\\Homework\\project\\cm_header_file.csv", header = TRUE, sep=",")
colnames(cm) <- colnames(cm_header)
data <- read.csv(file="C:\\Users\\jiachen xu\\Desktop\\Columbia Course\\Data Visualization\\Homework\\project\\filtered_v2.csv", header = TRUE, sep=",")

data_use <- data[,c("CMTE_ID","NAME","CITY","STATE","TRANSACTION_AMT")]
cm_use <- cm[,c("CMTE_ID","CMTE_CITY","CMTE_ST","CMTE_TP","CMTE_PTY_AFFILIATION")]
all <- merge(data_use,cm_use,by = 'CMTE_ID')
library(ggplot2)
library(tidyr)
library(dplyr)
library(plotly)
  1. General Exploration Evaluate the data
library(plotly)
temp <- all[,c('CITY','STATE','TRANSACTION_AMT')]
temp <- temp %>% group_by(STATE,CITY) %>% mutate(sum = sum(TRANSACTION_AMT))
temp <- temp[!temp$CITY == "",]
temp <- temp[!temp$STATE == "",]

temp <- temp[order(temp$sum),]
cities <- tail(temp,10)
cities <- c('LAS VEGAS','ATLANTA','PHILADELPHIA','DALLAS','LOS ANGELES')
temp <- temp[temp$CITY %in% cities,]
temp <- as.data.frame(temp)

temp$CITY <- factor(temp$CITY, levels = cities)
temp$STATE <- factor(temp$STATE, levels = unique(temp$STATE))
temp$lab_city <- as.integer(temp$CITY)
temp$lab_state <- as.integer(temp$STATE)

p <- temp %>%   plot_ly(type = 'parcoords',
              line = list(color = 'blue'),
               dimensions = list(
               list(range = c(1,length(unique(temp$STATE))),
                    tickvals  = c(1:length(unique(temp$STATE))),
                    label = 'STATE', values = ~lab_state,
                    ticktext = unique(temp$STATE)),
               list(range = c(1,length(unique(temp$CITY))),
                    tickvals  = c(1:length(unique(temp$CITY))),
                    label = 'CITY', values = ~lab_city,
                    ticktext = unique(temp$CITY)),
               list(range = c(~min(sum),~max(sum)),
                    label = 'SUM', values = ~sum))
             )

p

In this section, we focused on exploring the dataset from the geographic aspect. Firstly, we visualized the geographical dataset to evaluate the data. Some results appealed our attention, a number of cities point to different states. At first, we thought this may because of some pure errors in the dataset. However, after some investigation, we found there do have a number of cities share the same name but locate in different states. Considering this situation, we decided only apply ‘STATE’ as an index to explore the data to avoid misunderstanding by applying ‘CITY’ data. (In this graph, to make the observation clearer, we only select 5 cities.)

temp <- all[,c('CMTE_ST','STATE','TRANSACTION_AMT','CMTE_PTY_AFFILIATION')]
temp <- temp[!temp$STATE == '',]
temp <- temp %>% group_by(STATE) %>% summarise(sum = sum(TRANSACTION_AMT))

ggplot(temp, aes(sum, reorder(`STATE`,sum))) + geom_point(size = 5) + xlab("The Sum of Transaction") + ylab("The States") + labs(title = "The sum of Transaction among States ")+ theme(plot.title = element_text( size= 20),axis.title.x = element_text(size= 15),axis.title.y = element_text(size= 15), axis.text = element_text(size = 10))

From this graph, we observed that when considering the total transaction amount, ‘CA’, ‘NY’ and ‘TX’ led all other states, have much higher transaction amount. And the transaction amount among different states differed a lot, comparing ‘CA’ and ‘FM’. (This graph contained more than 50 states because some abbreviation represents transactions from outside US.)

temp <- all[,c('CITY','STATE','TRANSACTION_AMT')]
temp <- temp %>% group_by(STATE,CITY) %>% mutate(sum = sum(TRANSACTION_AMT),count = n())
temp <- temp[!temp$CITY == "",]
temp <- temp[!temp$STATE == "",]

temp <- temp[!(duplicated(temp$CITY) & duplicated(temp$STATE)),]
ggplot(temp, aes( count,sum )) + geom_point(size = 5) + xlab("The Number of Transaction") + ylab("The Sum of Transactions") + labs(title = "The Reltaionship between the Sum and the Number of Transaction among States ")+ theme(plot.title = element_text( size= 20),axis.title.x = element_text(size= 15),axis.title.y = element_text(size= 15), axis.text = element_text(size = 10))

According to this graph, we found that for most states, the sum of transactions was a positive correlation with the number of transactions. However, only one states have relatively larger number transactions but the lower amount of transactions, which violated this tendency a bit.

  1. Committee Type and States
    The intention of this graph was to detect whether there has a clear preference for committee types among different states. The data set including 16 different types of the committees in total. Of course, most of them are PAC(Political Action Committee). However, the data also includes some other types of the committee, like Campaign committee for the candidate for Senate, Campaign committee for the candidate for President. We want to dig whether individuals from different states have different preferences among these kinds of the committee as well. We achieved this through drawing this sankeynetwork.
library(networkD3)
temp1 <- all[,c('CMTE_TP','STATE','TRANSACTION_AMT')]
temp1 <- temp1 %>% group_by(STATE, CMTE_TP) %>% summarise(sum = sum(TRANSACTION_AMT))

name_vec <- c(as.character(unique(temp1$STATE)), as.character(unique(temp1$CMTE_TP)))

nodes <- data.frame(name = name_vec, id = seq(0,length(name_vec)-1,1))

links <- temp1 %>%
  left_join(nodes, by = c('STATE' = 'name')) %>%
  rename(state_id = id) %>%
  left_join(nodes, by = c('CMTE_TP' = 'name')) %>%
  rename(cmte_tp_id = id)


sankeyNetwork(Links = links, Nodes = nodes, Source = 'state_id', Target = 'cmte_tp_id', 
              Value = 'sum', NodeID = 'name', fontSize = 10)%>% 
  htmlwidgets::prependContent(htmltools::tags$h1("The sum of Transactions to types of  Committe among different States")) 

The sum of Transactions to types of Committe among different States

From this graph, from right to left, we could found that the committee with type(“Q”), PAC - Qualified got the most transactions. Following by type(“Y”)(Party - Qualified) and type(“W”)(PAC with Non-Contribution Account - Qualified). Comparing, with the unqualified committee, (N - PAC Nonqualified, X - Party - Nonqualified, V - PAC with Non-Contribution Account - Nonqualified), we could observe that individuals are more likely to donate to that qualified commitee.

Observing the graph from right to left, first, we could found that, CA, followed by VY and DC, were the top three states with the most transaction amount. If further focused our exploration on these three states, we could found the difference between individuals among different states. To be more specific, individual from CA prefer more about committee type (Y and W). However, individuals from DC preferred more to the committee with type(“O”)(Super PACs).

  1. Parties and States
    Now, let’s focus on a more interesting part. There have 61 different committees in the original dataset in total. At first, we tried to include all of them into one graph. But we found the final performance was not that good. So we decided to focus on the two largest committees, democratic and Republican party, which are also the two biggest parties in the united states. This graph intended to detect whether individual transactions from this dataset could represent the state political preference.
library(geofacet)
library(ggthemes)
temp2 <- all[,c('CMTE_PTY_AFFILIATION','STATE','TRANSACTION_AMT')]
temp2 <- temp2[temp2$CMTE_PTY_AFFILIATION == 'DEM' |temp2$CMTE_PTY_AFFILIATION == 'REP',]
temp2 <- temp2 %>% group_by(STATE, CMTE_PTY_AFFILIATION) %>% summarise(sum = sum(TRANSACTION_AMT))

name_vec <- c(as.character(unique(temp2$STATE)), as.character(unique(temp2$CMTE_PTY_AFFILIATION)))

nodes <- data.frame(name = name_vec, id = 0:63)

group.colors <- c(DEM = "#333BFF", REP = "#FF0000")
ggplot(temp2, aes(CMTE_PTY_AFFILIATION, sum, fill = CMTE_PTY_AFFILIATION)) +
    geom_col() +
     coord_flip() +
     facet_geo(~ STATE,grid = "us_state_grid2",scales = "free_x") + ylab("Sum of the Transactions")+ xlab("Parties")+ labs(title = "Sum of Transaction to Different Parties in Different States", fill = "Committee") + theme(axis.text.x = element_blank(),plot.title = element_text( size= 20),axis.title.x = element_text(size= 15),axis.title.y = element_text(size= 15))+scale_fill_manual(values=group.colors)+ scale_colour_colorblind()

For easier comparison among states, we draw the geo facet plot. And considering the intention of this graph is to compare the preference of the party among states, the x scales for each state is different, so the variance on the total transactions would not influence the people’s analysis. Also, we tested this graph on CVD simulator website (http://www.color-blindness.com/coblis-color-blindness-simulator/), make sure, even colour vision deficiency person could easily separate the two colours we applied.

Based on this graph, we concluded that CA, NY, MA, IL made more donation to Democratic committee. On the contrary, TX, FL, LA, TN made more contributions to the Republican committee. Also, there have some other states did not have the clear tendency, like CO. The visualization results were quite consistent with our background information. Hence, we concluded that individual transactions from this dataset could represent the state political preference.

4.Committee and States
In this part, we want to detect whether the time pattern in different states has a slight difference. To make the comparison easier, we set scales = “free_y”.

temp3 <- data[,c("TRANSACTION_DT","STATE","TRANSACTION_AMT")]
temp3 <- temp3 %>% group_by(STATE, TRANSACTION_DT) %>% summarise(sum = sum(TRANSACTION_AMT))
temp3 <- temp3[!temp3$STATE == "",]
temp3$datetime <- as.Date(temp3$TRANSACTION_DT)
ggplot(temp3, aes(datetime, sum)) +
    geom_line(color = "steelblue") +
     facet_geo(~ STATE, grid = "us_state_grid2", scales = "free_y") +
     ylab("Transaction Amount") +
     xlab("Transaction Time(From 2017-01-03 -- 2017-12-31)") + labs( title = "Transaction Time Flow at Different States") + theme(plot.title = element_text( size= 35),axis.text.x=element_text(angle=45,hjust=1),axis.title.x = element_text(size= 25),axis.title.y = element_text(size= 15))

Firstly, we observed that some states have regular time pattern. For instance, ‘FL’,‘PA’, the transactions experience a jump every a few months. However, ‘CA’ and ‘WY’ have the jump of the transactions were more frequently than other states. On the other hand, if ignoring the abnormal high transaction, the transactions at ‘ND’ was flatter. Also, transactions at ‘LA’ were more concentrated at the first half year, which tendency could also be observed at ‘SC’. But at ‘HI’ the transactions were more frequently in the second half of the year. All of these could help us explore the slight difference of the time pattern among states.

  1. Transaction Flow and States
library(rnaturalearth)
library(rgdal)
library(geosphere)
library(standardize)
library(leaflet)
library(RColorBrewer)
countries <- ne_countries()

states <- ne_states(iso_a2 = 'US')
temp7 <- all[,c('CMTE_ST','STATE','TRANSACTION_AMT')]
temp7 <- temp7 %>% group_by(STATE,CMTE_ST) %>% summarise(sum = sum(TRANSACTION_AMT))

states_xy <- states@data %>% select(postal, longitude, latitude)
temp7 <- temp7 %>%
  left_join(states_xy, by = c('STATE' = 'postal')) %>%
  left_join(states_xy, by = c('CMTE_ST' = 'postal'))

temp7$longitude.y <- as.numeric(as.character(temp7$longitude.y))

temp7$latitude.y <- as.numeric(as.character(temp7$latitude.y))
temp7 <- na.omit(temp7)

temp7$sum_scaled<- scale(temp7$sum)
flows <- gcIntermediate(temp7[,4:5], temp7[,6:7], sp = TRUE, addStartEnd = TRUE)

flows$amt <- temp7$sum_scaled

flows$origins <- temp7$STATE

flows$destinations <- temp7$CMTE_ST

hover <- paste0(flows$origins, " to ", 
                flows$destinations, ': ', 
                as.character(flows$amt))

pal <- colorFactor(brewer.pal(4, 'Set2'), flows$origins)

leaflet() %>%
  addProviderTiles('CartoDB.Positron') %>%
  addPolylines(data = flows, weight = ~amt, label = hover, 
               group = ~origins, color = ~pal(origins)) %>%
  addLayersControl(overlayGroups = unique(flows$origins), position = 'topright',
                   options = layersControlOptions(collapsed = FALSE,autoZIndex = TRUE))

We draw this graph to detect the transaction flow among different states. The observation was quite interesting. Nearly all transaction started form CA moves to DC and MA. The same situation found at NY. For FL, although most of the transaction move to DC, a number of different other states received transactions from FL as well, which may mean that comparing with individual form DC and NY, individuals form FL have more diversity transactions.

Also comparing with DC and NY, which only have these two transactions states, TX have some transaction flow to DC but not clear flow to MA.
And there have some transactions from MA to CA. Overall, most of the transactions end at DC, following by NY.

(This graph may have difficulty in generating in RMD, hence we attached a separate link for this graph.)